SDF-based dataset support by sfluegel05 · Pull Request #32 · ChEB-AI/python-chebai-graph

sfluegel05 · 2026-02-25T15:44:38Z

Add support for the SDF-based dataset (ChEB-AI/python-chebai#147). This includes mostly:

allowing Chem.Mol objects as input to _read_data functions
allowing Chem.Mol as input to read_property (also: the _read_data function now returns the augmented molecule dictionary which gets passed to read_property, avoiding a complete recalculation for each property)
using a standard sanitize_molecule function to ensure consistent SMILES parsing

One issue I came across while testing this: The AugAtomNumHs property (or any property inheriting the FrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here? Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset?

The above exception was the direct cause of the following exception:
    return self.compute_fn(*args)
  File "/home/staff/s/sifluegel/python-chebai/chebai/cli.py", line 46, in call_data_methods
    data.setup()
  File "/home/staff/s/sifluegel/python-chebai/chebai/preprocessing/datasets/base.py", line 524, in setup
    self._after_setup(**kwargs)
  File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/datasets/chebi.py", line 199, in _after_setup
    self._setup_properties()                                                                                        File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/datasets/chebi.py", line 168, in _setup_properties
    property.on_finish()                                                                                            File "/home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/properties/base.py", line 206, in on_finish                                                                                                              raise ValueError(                                                                                             ValueError: AugAtomNumHs attempted to add new tokens to a frozen encoder at /home/staff/s/sifluegel/python-chebai-graph/chebai_graph/preprocessing/bin/AtomNumHs/indices_one_hot.txt

aditya0by0 · 2026-02-26T11:17:57Z

One issue I came across while testing this: The AugAtomNumHs property (or any property inheriting the FrozenPropertyAlias) doesn't allow new tokens to be created. @aditya0by0 How should I add new tokens here?

Yes, the below approach will be a intended/correct approach. FrozenPropertyAlias was designed to reuse the non-augmented versions of properties when working with augmented data. Because of this design, alias properties are not allowed to create new tokens they can only use tokens that already exist. This restriction is intentionally enforced in the logic of FrozenPropertyAlias.

Should I first build a dataset with the non-augmented version of each property and then use those to created an augmented dataset

python-chebai-graph/chebai_graph/preprocessing/properties/base.py

Lines 156 to 171 in 63216b4

    
           class FrozenPropertyAlias(MolecularProperty, ABC): 
        
               """ 
        
               Wrapper base class for augmented graph properties that reuse existing molecular properties. 
        
               This allows an augmented property class (with an 'Aug' prefix in its name) to: 
        
               - Reuse the encoder and index files of the base property by removing the 'Aug' prefix from its name. 
        
               - Prevent adding new tokens to the encoder cache by freezing it (using MappingProxyType). 
        
               Usage: 
        
                   Inherit from FrozenPropertyAlias and the desired base molecular property class, 
        
                   and name the class with an 'Aug' prefix (e.g., 'AugAtomType'). 
        
               Example: 
        
                   ```python 
        
                   class AugAtomType(FrozenPropertyAlias, AtomType): ... 
        
                   ```

…on-chebai-graph into feature/sdf-support

…ith custom method in fg rules

sfluegel05 · 2026-03-03T15:07:28Z

I have created a regular dataset which has updated the tokens.txt files. I also changed the edge_dim parameter of the model accordingly. Now I am able to create an augmented dataset.

However, I get an error when trying to load the data. _merge_props_into_base fails with this error:

ValueError: Error assigning property 'AtomNodeLevel' values to node features: The expanded size of the tensor (24) must match the existing size (26) at non-singleton dimension 0.  Target sizes: [24, 3].  Tensor sizes: [26, 3]
Property values shape: torch.Size([26, 3]), expected (num_nodes, 3)
Node feature matrix shape: torch.Size([24, 203])

when trying to execute this statement (try-catch block added for debugging):

if isinstance(property, AllNodeTypeProperty):
    try:
        x[:, atom_offset : atom_offset + enc_len] = property_values
    except Exception as e:
        raise ValueError(
            f"Error assigning property '{property.name}' values to node features: {e}\n"
            f"Property values shape: {property_values.shape}, expected (num_nodes, {enc_len})\n"
            f"Node feature matrix shape: {x.shape}"
        )

It looks like the number of nodes in the dataset (here: 24) is not the same as the number of nodes for which AtomNodeLevel provides values (here: 26). I tried re-generating the dataset but still get the same results. @aditya0by0 Do you have an idea why this could happen?

sfluegel05 · 2026-03-06T14:38:34Z

I also changed the linting to using only ruff (mirroring the main repository: ChEB-AI/python-chebai#145)

Copilot

Pull request overview

Adds SDF-based dataset support across the graph preprocessing pipeline by accepting RDKit Chem.Mol inputs, reusing augmented molecule structures for property computation, and standardizing molecule sanitization.

Changes:

Extend readers/datasets to accept str | Chem.Mol inputs and propagate augmented molecule dictionaries to read_property.
Standardize RDKit SMILES parsing via sanitize=False + shared _sanitize_molecule.
Update property token bins / model config edge dimensions and migrate linting to pre-commit + Ruff.

Reviewed changes

Copilot reviewed 15 out of 19 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
configs/model/res_aug_amgpool.yml	Updates `edge_dim` to reflect additional bond feature(s).
configs/model/res_aug_aapool.yml	Updates `edge_dim` to reflect additional bond feature(s).
configs/model/gat_aug_amgpool.yml	Updates `edge_dim` to reflect additional bond feature(s).
configs/model/gat_aug_aapool.yml	Updates `edge_dim` to reflect additional bond feature(s).
chebai_graph/preprocessing/reader/reader.py	Adds RDKit-mol support and sanitization to readers; changes `_read_data` return shape for property reader.
chebai_graph/preprocessing/reader/augmented_reader.py	Allows `Chem.Mol` inputs and returns augmented molecule dict alongside `GeomData`; updates property reading to accept dict.
chebai_graph/preprocessing/property_encoder.py	Makes `IndexEncoder.cache` robustly mutable when adding new tokens.
chebai_graph/preprocessing/properties/properties.py	Adds validation around `mol` and SMILES conversion before descriptor generation.
chebai_graph/preprocessing/properties/base.py	Formatting-only assertion message restructuring.
chebai_graph/preprocessing/fg_detection/fg_aware_rule_based.py	Uses shared sanitization when rebuilding fragment mols from SMILES.
chebai_graph/preprocessing/datasets/chebi.py	Avoids redundant augmentation/property recomputation; supports tuple-returning `_read_data` for downstream property reading.
chebai_graph/preprocessing/bin/NumAtomBonds/indices_one_hot.txt	Extends token set for NumAtomBonds.
chebai_graph/preprocessing/bin/BondType/indices_one_hot.txt	Extends token set for BondType (adds `UNSPECIFIED`).
chebai_graph/preprocessing/bin/AtomNumHs/indices_one_hot.txt	Extends token set for AtomNumHs.
chebai_graph/preprocessing/bin/AtomFunctionalGroup/indices_one_hot.txt	Extends token set for AtomFunctionalGroup.
chebai_graph/models/dynamic_gni.py	Formatting-only assertion message restructuring.
.pre-commit-config.yaml	Migrates lint/format hooks to Ruff + updates hook set.
.github/workflows/pre-commit.yml	Adds CI workflow to run `pre-commit`.
.github/workflows/lint.yml	Removes prior standalone lint workflow.

Comments suppressed due to low confidence (1)

chebai_graph/preprocessing/reader/reader.py:170

GraphReader._read_data now calls _smiles_to_mol() which returns an RDKit Chem.Mol, but the method still asserts isinstance(mol, nx.Graph) and then iterates mol.nodes/mol.edges. This will fail at runtime for every input. Either restore the previous pysmiles/NetworkX parsing path, or refactor _read_data to build the NetworkX graph from the RDKit molecule (and adjust types/docs accordingly).

        # raw_data is a SMILES string
        try:
            mol = (
                self._smiles_to_mol(raw_data) if isinstance(raw_data, str) else raw_data
            )
        except ValueError:
            return None
        assert isinstance(mol, nx.Graph)
        d: dict[int, int] = {}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T06:15:12Z

chebai_graph/preprocessing/datasets/chebi.py

+            # augment molecule graph if possible (this would also happen for the properties if needed, but this avoids redundancy)
+            if isinstance(self.reader, _AugmentorReader):
+                returned_results = []
+                for mol in features:
+                    try:
+                        r = self.reader._create_augmented_graph(mol)
+                    except Exception:
+                        r = None
+                    returned_results.append(r)
+                mols = [
+                    augmented_mol[1] if augmented_mol is not None else None
+                    for augmented_mol in returned_results
                ]


In _setup_properties, when self.reader is an _AugmentorReader you call self.reader._create_augmented_graph(mol) for each entry in features. For existing SMILES-based datasets features is typically a str, but _create_augmented_graph expects an RDKit Chem.Mol and will fail (caught and turned into None), resulting in mols being all None and property extraction being skipped. Consider using self.reader.read_property(...) directly, or convert SMILES to Chem.Mol via the reader’s _smiles_to_mol before calling _create_augmented_graph.

Copilot · 2026-03-08T06:15:12Z

chebai_graph/preprocessing/datasets/chebi.py

+        if isinstance(row["features"], tuple):
+            geom_data, _ = row[
+                "features"
+            ]  # ignore additional returned data from _read_data (e.g. augmented molecule dict)
+        else:
+            geom_data = row["features"]


GraphPropertiesMixIn._merge_props_into_base now supports row["features"] being a tuple, but other mixins in this same module (e.g., the augmented-graph property mixins) still treat row["features"] as a GeomData and read masks from it. If the processed dataset stores (GeomData, augmented_molecule) (as _AugmentorReader._read_data now returns), those asserts/mask accesses will break at load time. Consider normalizing row["features"] to GeomData earlier (or updating the remaining call sites to unwrap tuples consistently).

Copilot · 2026-03-08T06:15:13Z

chebai_graph/preprocessing/datasets/chebi.py

+
+
+if __name__ == "__main__":
+    dataset = ChEBI25_WFGE_WGN_AsPerNodeType(chebi_version=248, subset="3_STAR")
+    dataset.prepare_data()
+    dataset.setup()


This file appears to include an executable __main__ block that runs dataset preparation/setup. For a library/module file this is easy to trigger accidentally and can lead to long-running side effects when users run the module directly. Consider removing it from the PR, or moving this into a dedicated CLI/script (or a reproducible notebook) if it’s meant as a debugging entrypoint.

Suggested change

if __name__ == "__main__":

dataset = ChEBI25_WFGE_WGN_AsPerNodeType(chebi_version=248, subset="3_STAR")

dataset.prepare_data()

dataset.setup()

Copilot · 2026-03-08T06:15:13Z

.github/workflows/pre-commit.yml

+    - uses: actions/checkout@v4
+
+    - uses: actions/setup-python@v5
+      with:
+        python-version: '3.10'
+
+    - name: Run pre-commit
+      uses: pre-commit/action@v3.0.1


The workflow YAML is invalid due to indentation: steps: must contain a properly-indented list (the - uses: entries should be indented under steps:). As written, GitHub Actions will fail to parse this workflow.

Suggested change

- uses: actions/checkout@v4

- uses: actions/setup-python@v5

with:

python-version: '3.10'

- name: Run pre-commit

uses: pre-commit/action@v3.0.1

- uses: actions/checkout@v4

- uses: actions/setup-python@v5

with:

python-version: '3.10'

- name: Run pre-commit

uses: pre-commit/action@v3.0.1

Copilot · 2026-03-08T06:15:13Z

.pre-commit-config.yaml

-    hooks:
-    -   id: black
-    -   id: black-jupyter # for formatting jupyter-notebook
+# Use `pre-commit autoupdate` to update all the hook.


Minor grammar: "update all the hook" should be "update all the hooks".

Suggested change

# Use `pre-commit autoupdate` to update all the hook.

# Use `pre-commit autoupdate` to update all the hooks.

Copilot · 2026-03-08T06:15:13Z

chebai_graph/preprocessing/reader/reader.py

+            raw_data (str | Chem.Mol): SMILES string or RDKit molecule object.

        Returns:
            GeomData | None: Graph data object or None if molecule parsing failed.


The _read_data docstring still documents returning GeomData | None, but the function signature and implementation now return a tuple (GeomData, mol). Please update the docstring return type/description to match the new behavior to avoid misleading callers.

Suggested change

GeomData | None: Graph data object or None if molecule parsing failed.

tuple[GeomData, Chem.Mol] | None: A tuple of (graph data object, RDKit molecule)

or None if molecule parsing failed.

Copilot · 2026-03-08T06:15:14Z

chebai_graph/preprocessing/fg_detection/fg_aware_rule_based.py

+        try:
+            frag_mol = _sanitize_molecule(frag_mol)
+        except Exception:
+            pass
        frag_rings = frag_mol.GetRingInfo().AtomRings()


frag_mol can be None when Chem.MolFromSmiles(frag, sanitize=False) fails, or it can become None if sanitization fails. In both cases the next line calls frag_mol.GetRingInfo(), which will raise an AttributeError. Add a None check (and decide whether to mark is_ring_fg as False or skip the fragment) before accessing ring info.

Suggested change

try:

frag_mol = _sanitize_molecule(frag_mol)

except Exception:

pass

frag_rings = frag_mol.GetRingInfo().AtomRings()

if frag_mol is not None:

try:

frag_mol = _sanitize_molecule(frag_mol)

except Exception:

# If sanitization fails, fall back to the original fragment molecule

pass

# If molecule creation or sanitization failed, treat as having no rings

if frag_mol is None:

frag_rings = ()

else:

try:

frag_rings = frag_mol.GetRingInfo().AtomRings()

except Exception:

frag_rings = ()

Copilot · 2026-03-08T06:15:14Z

chebai_graph/preprocessing/datasets/chebi.py

+        if isinstance(result["features"], tuple):
+            result["features"], raw_data = result["features"]
        if result is None or result["features"] is None:
            return None


_preprocess_smiles_for_pred checks isinstance(result["features"], tuple) before verifying that result is not None. Since the next line explicitly handles result is None, this can currently raise a TypeError/AttributeError when to_data(...) returns None. Move the result is None / result["features"] is None guard before dereferencing result["features"].

Suggested change

if isinstance(result["features"], tuple):

result["features"], raw_data = result["features"]

if result is None or result["features"] is None:

return None

if result is None or result["features"] is None:

return None

if isinstance(result["features"], tuple):

result["features"], raw_data = result["features"]

sfluegel05 added 8 commits February 25, 2026 11:58

add option for passing mol object, gentle error handling

5576839

update reader for mol objects

95adba2

fix read property

ad4432f

make read_properties more flexible, don't recalculate properties

66c9659

avoid redundant property calculation

be7278d

fix function call

a50ce8f

catch mol processing errors

b629f9c

make sure that property cache is mutable

430c72c

sfluegel05 and others added 13 commits February 28, 2026 22:44

fix data loading

2b9c1e5

add chebi25 dataset

3a53cd2

use chebi-utils

db3d429

fix function name

e5d870b

fix function name

315f49c

fix import

de3193e

add new tokens

ea77f36

catch none

354225b

Merge branch 'feature/sdf-support' of https://github.com/ChEB-AI/pyth…

e78183e

…on-chebai-graph into feature/sdf-support

fix assertion error

d2bbad5

only calculate entended molecule graph if needed, sanitize molecule w…

0ba4f11

…ith custom method in fg rules

update default values in configs

6a85dd2

reformat w/ ruff

578c2ef

sfluegel05 added 6 commits March 6, 2026 10:59

update fg tokens

4abdcf3

property failure handling

885a405

assert idents match property values

7f6cb23

format

d2522f7

use ruff for precommit

4b8124b

ruff format

9ba3db5

sfluegel05 marked this pull request as ready for review March 6, 2026 14:37

aditya0by0 requested a review from Copilot March 8, 2026 06:11

Copilot started reviewing on behalf of aditya0by0 March 8, 2026 06:11 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDF-based dataset support#32

SDF-based dataset support#32
sfluegel05 wants to merge 27 commits intodevfrom
feature/sdf-support

sfluegel05 commented Feb 25, 2026

Uh oh!

aditya0by0 commented Feb 26, 2026 •

edited

Loading

Uh oh!

sfluegel05 commented Mar 3, 2026

Uh oh!

sfluegel05 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# Use `pre-commit autoupdate` to update all the hook.
	# Use `pre-commit autoupdate` to update all the hooks.

	GeomData \| None: Graph data object or None if molecule parsing failed.
	tuple[GeomData, Chem.Mol] \| None: A tuple of (graph data object, RDKit molecule)
	or None if molecule parsing failed.

Conversation

sfluegel05 commented Feb 25, 2026

Uh oh!

aditya0by0 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfluegel05 commented Mar 3, 2026

Uh oh!

sfluegel05 commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aditya0by0 commented Feb 26, 2026 •

edited

Loading